Making a phylogenetic tree using #Invertefest observations

Author

Maxime Dahirel

Published

2023-09-02

#InverteFest is a periodic online event where we invite you to celebrate the overlooked invertebrate fauna around you and share the joys of discovery online.
The hashtag was conceived when Franz Anthony and Maureen Berg went looking for bugs and slugs in Bali. Kelly Brenner, stuck in Seattle, wondered if we could invite our online friends to look for bugs and slugs together in spirit, even though we’re physically far apart. Besides, what may be an everyday creature to you is often exciting to someone who lives half the world away!

(from the #Invertefest August 2023 page on iNaturalist)

During this summer’s (2023) #Invertefest, I randomly decided it would be neat to try and make a phylogenetic tree from all the iNat observations.

Code: load packages
library(tidyverse) # CRAN v2.0.0
library(httr)      # CRAN v1.4.7
library(ape)       # CRAN v5.7-1
library(rotl)      # CRAN v3.1.0
library(ggtree)    # Bioconductor v3.8.2
library(ggtext)    # CRAN v0.1.2
library(rphylopic) # CRAN v1.1.1 
library(showtext)  # CRAN v0.9-6  
library(here)      # CRAN v1.0.1

Loading iNat observations

There are in two main options to get the list of observed taxa. We can rely on data predownloaded manually through the website, or we can directly query the API via httr::GET().

Code: load data from pre-downloaded csv
#invertefest_summer23 <- read_csv(here("raw_data","observations-354487.csv"))
Code: get data by request to the iNat API, then parse the result
# first let's make a general request to see how many obs there is
request <- GET("https://api.inaturalist.org/v1/observations?project_id=invertefest-august-2023")
request_parsed <- content(request)
N_obs <- request_parsed$total_results
# this is the right number, compare with rinat which give a number that's way too high
# library(rinat)
# N_obs_rinat <- rinat::get_inat_obs_project("invertefest-august-2023",type="info")$taxa_count
#THIS is wrong

Npages <- ceiling(N_obs / 200) # with a max per page set at 200, this gives the number of page requests we're gonna need

prefix_url <- "https://api.inaturalist.org/v1/observations?project_id=invertefest-august-2023&page="
suffix_url <- "&per_page=200&order=desc&order_by=created_at"

invertefest_summer23 <- tibble(page = 1:Npages) |>
  mutate(url = paste0(prefix_url, page, suffix_url)) |>
  mutate(response = map(.x = url, .f = ~ .x |> GET())) |>
  mutate(parsed = map(.x = response, .f = ~ .x |>
    content() |>
    pluck("results"))) |> # we get the list containing the results for each page
  unnest(parsed) |>
  mutate(scientific_name = map(.x = parsed, .f = ~ .x |> pluck("taxon", "name"))) |> # we extract the taxon name
  unnest(scientific_name)
Note

In theory, we could use the rinat package to query the API instead, but it has some known issues; here for instance rinat would say the project has tens of 1000s of obs when it only had 1980 when this page was last updated.

Generating tree data

Now that we have our data, we are going to use the rotl package to interact with the Open Tree of Life:
- first to get Open Tree Taxonomy IDs from our taxon names
- then to get a subtree of the Open Tree of Life from these IDs

Code: get OTT IDs
taxa <- unique(invertefest_summer23$scientific_name)

resolved_names <- tnrs_match_names(taxa, context_name = "Animals")|>
  filter(!is.na(ott_id))

resolved_names$in_tree <- is_in_tree(resolved_names$ott_id) # takes a while
# filtering out flags likely to signal "not in tree" status (eg incertae sedis) works faster but may be more approximate/trial-and-error

filtered_names <- resolved_names |>
  filter(in_tree==TRUE & score > 0.9) # can adjust score
Taxons lost in processing

Not all names are successfully matched by tnrs_match_names(), some are matched but matches are low-quality and likely wrong, some are matched but the taxon is not in the synthetic tree and needs to be removed to avoid errors when generating our subtree. We still get 1081 OTT IDs out of 1223 unique names in the input, which is good enough for now.

Code: get the tree from OTOL
tol_induced_subtree(
  ott_ids = unique(filtered_names$ott_id),
  label_format = "name", file = here("tree_data", "tree.newick")
)

my_tree <- ape::read.tree(here("tree_data", "tree.newick"))

We save the tree in Newick format to file, and then reimport it immediately. We actually don’t need to do that in most cases, we could directly get the tree as a phylo object, by letting the file argument empty. But doing that tends to collapse a lot of internal nodes (which contains the clade names), and we’re going to use some of those to annotate the tree.

Plotting the tree

We use the ggtree package to plot and annotate the tree. We can use many types of annotations. We’re going to combine colour highlighting of major clades with PhyloPic images of some iconic groups, courtesy of the rphylopic package. We also make sure that the silhouettes are properly credited on the plot.

Code: prepare tree annotations
d1 <- data.frame(
  node = c(
    length(my_tree$tip.label) + which(my_tree$node.label == "Insecta"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Malacostraca"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Araneae"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Mollusca"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Myriapoda")
  ),
  taxon = c("Insects", "Malacostracans","Spiders", "Molluscs", "Myriapods")
)


d2 <- data.frame(
  node = c(
    length(my_tree$tip.label) + which(my_tree$node.label == "Diptera"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Hymenoptera"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Coleoptera"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Lepidoptera"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Orthoptera"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Araneae"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Gastropoda"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Odonata"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Hemiptera"),
    length(my_tree$tip.label) + which(my_tree$node.label == "Myriapoda")
  ),
  taxon = c(
    "Diptera", "Hymenoptera", "Coleoptera", "Lepidoptera",
    "Orthoptera", "Araneae", "Gastropoda", "Odonata", "Hemiptera", "Myriapoda"
  ),
  image = c(
    get_uuid(name = "Drosophila americana", n = 1),
    get_uuid(name = "Vespula", n = 1),
    get_uuid(name = "Carabus", n = 1),
    get_uuid(name = "Papilio", n = 1),
    get_uuid(name = "Acrididae", n = 1),
    get_uuid(name = "Araneus", n = 1),
    get_uuid(name = "Helix aspersa", n = 1),
    get_uuid(name = "Sympetrum", n = 1),
    get_uuid(name = "Lycorma delicatula", n = 1),
    get_uuid(name = "Lithobius forficatus", n = 1)
  )
) |> 
  mutate(contributor = map(.x=image,.f=~.x |> get_attribution() |> pluck("contributor")),
         license = map(.x=image,.f=~.x |> get_attribution() |> pluck("license"))
  ) |> 
  mutate(lic=str_remove(license,"https://creativecommons.org/")) |> 
  mutate(lic=case_when(str_detect(lic,"publicdomain")~"*",T~lic)) |> 
  mutate(lic=str_remove(lic,"licenses/"))|> 
  mutate(lic=str_replace_all(lic,"/$","")) |> 
  mutate(lic=str_replace_all(lic,"/"," ") |> 
           str_to_upper()) |> 
  mutate(lic=case_when(lic!="*"~paste("CC", lic),TRUE~"public domain")) |> 
  mutate(attr = paste0(taxon,": ", contributor, ", ", lic)) |> 
  arrange(taxon)

phylopic_credit <- paste0("Animal silhouettes are from **phylopic.org**. ",paste(d2$attr,collapse="; "),".")

figure_author <- "Data visualisation by **Maxime Dahirel**, underlying code: **github.com/mdahirel/invertefest_tree**"

#(tips are counted as nodes for `ggtree` purposes, that's why we add the number of tips to get the correct node values)
Code: create the tree
font_add_google(name="Open Sans",family="Open Sans")
font_add_google(name="Ubuntu",family="Ubuntu")

showtext_auto()

tree_line_width <- 0.35

p <- ggtree(my_tree, layout = "circular", size=tree_line_width) +
  geom_highlight(data = d1, aes(node = node, fill = taxon)) +
  geom_tree(size=tree_line_width) +
  geom_cladelab(
    data = d2,
    mapping = aes(node = node, label = taxon, image = image), imagesize = 0.1,
    geom = "phylopic", imagecolour = "black", offset = 1, offset.text = 10, alpha = 1
  ) +
  scale_fill_brewer(name= "**Highlighted taxa**",palette = "Dark2") + 
  labs(
    title="**How diversified are #InverteFest observations?**",
    subtitle="A phylogenetic tree based on **iNaturalist** data from August 2023",
    caption = paste0(figure_author,"<br/><br/>",phylopic_credit)
  )+
  theme(text=element_text(family = "Ubuntu", size = 12),
        plot.title=element_markdown(family = "Open Sans", size = 18),
        plot.subtitle=element_markdown(family = "Ubuntu"),
        legend.title=element_markdown(family = "Ubuntu"),
        #legend.position = "bottom",
        plot.caption = element_textbox_simple(family = "Ubuntu",size=8, halign=1),
        plot.caption.position = "plot"
        )

ggsave(here("tree_plots","tree.pdf"), plot = p, device=cairo_pdf)
ggsave(here("tree_plots","tree.svg"), plot = p)

knitr::include_graphics(here("tree_plots","tree.svg"))

Note

You may have noticed that there are fewer tips in the tree than we have “valid” OTT IDs (856 vs 1081, actually). This is because higher-level IDs will end up as internal nodes if there exists another observation from the same taxon but with a more precise ID (for instance, for a set of two observations, one identified as Cepaea sp. and one as Cepaea nemoralis, we would end up with only one tip in the tree, Cepaea nemoralis, with Cepaea as an internal node).